Summarizing and Mining Skewed Data Streams
نویسندگان
چکیده
Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with ε accuracy and show that the space needed is only O(ε−min{1,1/z}). This is the first o(1/ε) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/ε) space for z > 1 2 , another improvement over the existing Ω(1/ε) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.
منابع مشابه
Mining Data Streams with Skewed Distribution based on Ensemble Method
In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well skewed (e.g., few positives but lots of negatives) and skewed distributions, which are typical in many data stream applications. In this paper, we propose an ensemble and cluster based sample method...
متن کاملSummarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling
Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be “viewed” in different ways. A data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of oc...
متن کاملAlgorithmic Techniques for Processing Data Streams
We give a survey at some algorithmic techniques for processing data streams. After covering the basic methods of sampling and sketching, we present more evolved procedures that resort on those basic ones. In particular, we examine algorithmic schemes for similarity mining, the concept of group testing, and techniques for clustering and summarizing data streams. 1998 ACM Subject Classification F...
متن کاملAn Hybrid Data Stream Summarizing Approach by Sampling and Clustering
Computer systems generate a large amount of data that, in terms of space and time, is very expensive even impossible to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in the computer systems. One solution is to treat the data as streams that can be p...
متن کاملMin-wise independent sampling from skewed data streams
Min-wise independent hashing is a powerful sampling technique for estimating the similarity between sets. In particular, it has proved to be ubiquitous for mining data streams of large volume where the input sets are revealed in arbitrary order and the elements in a given set do not arrive consecutively. More precisely, for sets of elements E and attributes A the input is a stream of element-at...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005